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Abstract 

• With the increasing use of massively parallel sequencing approaches in evolu- 
tionary biology, the need for fast and accurate methods suitable to investigate 
genetic structure and evolutionary history are more important than ever. We 
propose new distance measures for estimating genetic distances between indi- 
viduals when allelic variation, gene dosage and recombination could compro- 
mise standard approaches. 

• We present four distance measures based on single nucleotide polymorphisms 
(SNP) and evaluate them against previously published measures using coalescent- 
based simulations. Simulations were used to test (i) whether the measures give 
unbiased and accurate distance estimates, (ii) if they can accurately identify 
the genomic mixture of hybrid individuals and (Hi) if they give precise (low 
variance) estimates. 

• The results showed that the SNP-based genpofad distance we propose appears 
to work well in the widest circumstances. It was the most accurate method 
for estimating genetic distances and is also relatively good at estimating the 
genomic mixture of hybrid individuals. 

• Our simulations provide benchmarks to compare the performance of different 
distance measures in specific situations. 

Key- words: Single nucleotide polymorphisms (SNPs), genetic distances, poly- 
ploidy, hybridization, population genomics, coalescent, simulations. 
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Introduction 

The last few decades have witnessed a methodological revolution in the field of popula- 
tion genetics. Model-based likelihood approaches have been propelled to the forefront 
of species and population level studies (e.g. Beaumont and Rannala 2004; Beaumont 
et al. 2002; Huelsenbeck et al. 2001). These changes have been made possible by the 
remarkable advances in computing technology and the application of computationally 
intensive Monte Carlo methodology. 

But even these sophisticated methods are facing critical challenges confronted by 
the overwhelming amount of data generated by massively parallel sequencing tech- 
nologies. In many cases, state-of-the-art approaches in terms of models and methods 
cannot always accommodate population genomics data. Consequently, quick and rapid 
approaches that allow for investigations of patterns and processes still have their utility 
in this discipline. 

Our objective is to present new, flexible, and robust distance measures for estimating 
genetic distances from single nucleotide polymorphisms (SNPs) data. We focus on the 
estimation of distances between individuals (or organisms), even though the distances 
could certainly be useful in many other circumstances. There are good reasons to focus 
at the level of individuals rather than populations or species. Individuals are central 
to biology. Measurements based on morphology, spatial positioning, or genetics are 
generally performed at the individual level. Individuals are also the fundamental units 
of natural selection, the central concept of evolutionary biology. And finally, estimates 
of genetic relatedness between individuals can reveal correlations between genetic and 
phenotypic distances, spatial genetic structure across a landscape, species boundaries, 
and could be used for genetic or phylogenetic diversity (PD) surveys. 

Although obtaining genetic distances among individuals seems relatively straight- 
forward, there can be several complicating factors. One is the presence of SNPs among 
gene copies in non-haploid individuals. Polyploidy, which is defined by the presence 
of more than two genome copies in a nucleus, leads to further complexities. Not only 
is there the potential presence of more than two character states for each nucleotide, 
there is also the potential for non-conventional segregation of chromosomes. Finally, 
recombination along chromosomes renders the problem of calculating distances between 
organisms even more complex. Given the importance of estimating genetic distances 
between individuals and the increasing availability of genome-wide sequence data, we 
think that this issue deserves further investigation. 

Only a few approaches, generally motivated by very different research questions, 
have been proposed to handle SNPs, polyploidy or recombination. Although not based 
on sequence data, Bruvo et al. (2004) proposed an interesting approach to deal with 
ploidy level variation for estimating the distances between individuals from microsatel- 
lites data that could be generalized to sequence data. Their method consisted in com- 
paring directly the alleles of one individual with that of another, while accounting for 
the "missing alleles" in comparisons between ploidy levels. Joly and Bruneau (2006) 
proposed the pofad algorithm to estimate the genetic distance of individuals from al- 
lelic sequence information. Their idea for comparing homozygotes and heterozygotes 
could be seen as comparing alleles that share a most common recent ancestor. However, 
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their implementation could not be applied to polyploid organisms. Later, Goker and 
Grimm (2008) proposed different methods to estimate distances between "populations" 
using, among others, community ecology statistics such as Shannon's entropy or Eu- 
clidean distances. Although not originally designed for the problem we address here, 
they could nevertheless be relevant if one considers an individual as a "population" of 
sequences. Their approaches could be applied to individuals of mixed ploidy levels, but 
they did not deal with the potential presence of recombination. 

Here, we propose four methods for estimating genetic distances between individuals 
from nucleotide sequence data. One of these is an adaptation of Nei's genetic distance 
(Nei et al. 1983) for this specific problem, but the three other methods are novel. All 
methods are very general in that they can be applied to individuals of any ploidy 
level, but also when individuals have different ploidy levels. We first describe in detail 
the challenges involved in estimating genetic distances between individuals. We then 
describe the new methods and compare them and others using simulations. We finish 
by making recommendations on the use of distance measures in different contexts. 

Problems associated with the estimation of distances 
between individuals 

Allelic variation 

If the estimation of genetic distances between DNA sequences is straightforward, the 
potential presence of more than one allele at autosome loci in non-haploid individuals 
makes it more complex to estimate the genetic distances between individuals, especially 
when combining information from multiple loci (Joly and Bruneau 2006). Also, one 
important property of distances that measure overall difference between individuals is 
that the comparison of a heterozygous individual with itself should have a distance of 0, 
something that is not necessarily obtained with all existing approaches. For instance, 
taking the mean pairwise distance between all alleles will not generally give a distance 
of 0 when comparing an individual with itself. 

Polyploidy 

Polyploidy brings two other problematic issues: inheritance and gene dosage. Inheri- 
tance of diploids is always disomic while it can be either disomic or multisomic in poly- 
ploids (Comai 2005). Polyploids are disomic if chromosomes group by pairs at meiosis, 
one example being homeologous chromosomes in allopolyploids. However, they are mul- 
tisomic when chromosomes form multivalents. In many cases, inheritance of polyploid 
taxa is unknown or difficult to determine precisely. Some polyploids are even charac- 
terized by a mixture of inheritance modes. For instance, a marker could have mainly 
disomic inheritance with occasional multisomic inheritance, or different chromosomes 
could have different modes of inheritance within a genome (Wendel 2000). 

Gene dosage is another issue associated with polyploidy (Bruvo et al. 2004). In 
diploids, gene dosage is obvious: a homozygous individual has two copies of the same 
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allele and a heterozygous individual has one copy of each allele. In polyploids, it is 
rare that we know the exact dosage of each allele in the genome. A tetraploid that has 
the observed nucleotide state 'A' at a position (i.e., it is homozygote) can only have 
genotype 'AAAA'. However, a tetraploid individual with observed states 'A' and 'T' at 
a site could have the genotypes 'ATTT', 'AATT, or 'AAAT'. The unknown dosage of 
these character states makes it more difficult to estimate precisely the genetic distances 
between polyploids. The situation can become even more complicated when there are 
more than two character states at a sequence site, a feature that becomes more likely in 
higher polyploids. Finally, another important feature of the desired distance measure is 
the capacity to estimate distances between individuals of different ploidy levels (Bruvo 
et al. 2004). 



Distance definitions 

We propose four new distance measures to calculate the genetic distance between in- 
dividuals from sequence data. The main novelty of these proposed measures is that 
they are all computed at the nucleotide level. Therefore, we define them first at the 
individual nucleotide site level, and explain later how these distances can be extended 
to strings of nucleotides, some potentially linked (within loci) and others unlinked. 
These measures assume that we know the nucleotides present at a given position in 
an individual but not necessarily gene dosage, which is typical for data obtained from 
genotyping or sequencing. All proposed distances are bounded between 0 and 1 and 
have the property that the distance between an individual and itself is 0. 



matchstates 

This measure looks at each nucleotide present at a given sequence site in one individual 
and checks if there is a nucleotide in the other individual that matches. More formally, 
consider a specific sequence site i that might be present in multiple alleles or gene copies 
in an individual. Let A l x be the complete set of nucleotides for individual X at site 
i and let \A X \ be the number of nucleotide states observed for individual X at site i. 
The matchstates distance between individual X and individual Y at site i is 

MATCHSTATES Vv := 7-^7 r^r, 

XY \A X \ + \A\\' 

where A x AA l Y denotes the set of elements that belong to either A l x or Ay, but not in 
both. 



genpofad 

The genpofad measure is named after the pofad algorithm described by Joly and 
Bruneau (2006). The genpofad distance can be defined as one minus the ratio of the 
number of nucleotides shared between two individuals divided by the maximum number 
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of nucleotides observed in either of the individuals at a given sequence site. Following 
the notation introduced above, 

GENPOFAD^y := 1 - 





A x n A 


i j 
Y\ 




max(\A x \ 




Ay\) 



mrca 



The mrca distance measure gives a distance of 0 whenever two individuals share at 
least one nucleotide at a given site and a distance of 1 otherwise. Formally, the mrca 
distance between individual X and individual Y is 



MRCA^ y 



C 


if 


1 A i 


nA\r\ 




if 




nA\r\ 



nei 

This distance is the application of Nei's genetic distance (Nei et al. 1983) at the nu- 
cleotide level. The frequency of each nucleotide is estimated per site for each individual 
and then nei genetic distance between individual X and individual Y for site i is 
estimated as 

A,C,T,G 
j 

where Pj e x is the frequency of nucleotide j in individual X at site i. This formula is 
flexible as it can be easily applied among individuals from different ploidy levels. Gene 
dosage is assumed to be known, but it can also be used if it is unknown by giving equal 
weight to each nucleotide present. 



Extension to multiple sites and genes 

The extension of all distance measures to many sites within a locus is easily done by 
taking the average distance over all DNA positions such as 

1 3 

dxY = - 2_] d XY} 

where s is the number of sites and d XY is the contribution of site % to the distance. An 
estimate of standard error is then provided by the standard statistical formula 

1 s 

var{d XY ) = , _ , Efev - d XY? ■ 
s \ s -U i=i 

In some cases, it might be important to divide nucleotides into different loci, such 
as when several unlinked genes are sampled throughout the genome, each containing 
several linked nucleotides. We suggest distances be calculated first across sites within 
a marker to obtain distance matrices for each marker. Once this is done, one can 
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compute a genome-wide distance matrix by taking the mean of all marker matrices. 
In calculating this genome-wide distance matrix, it is possible to scale each individual 
matrix by dividing the distances of a given matrix by the maximum distances in that 
matrix. This scaling gives the same weight to all markers whatever their variability, 
which could be interesting if the markers do not have the same evolution rates (e.g., 
exons, introns, non-coding regions, etc.). If the nucleotides cannot easily be divided into 
distinct loci, such as when we have a long contiguous sequence along a chromosome, the 
average distance over all DNA positions is appropriate because each site is then assumed 
to represent an independent assessment of the distance between the individuals. 

Implementation 

All these algorithms are implemented in POFAD version 1.06 (www.plantevolution.org/ 
en/pofad.html). The matchstates algorithm is also implemented in SplitsTree4 (Huson 
and Bryant 2006). 

Simulations 

Computer simulations were performed to compare the performance of the distances in 
different situations. We evaluated three properties of the distance measures. First, we 
tested if the measures provide an unbiased and accurate estimate of distances between 
organisms. Second, we investigated how the different distances are able to detect the 
genomic mixture of hybrid individuals. Third, we evaluated how precise these different 
measures were. We evaluated our new distance metrics along with other previously 
published distances of Goker and Grimm (2008) that are relevant in the present context: 
the min distance and the Phylogenetic Bray-Curtis (pbc) distance (see Appendix for 
mathematical definition). The frq and the entropy distance measures of Goker and 
Grimm (2008) were not investigated because they are not bounded between 0 and 1 and 
because they are more relevant in a context of host-parasite associations as originally 
described. Finally, we also evaluated the recent 2lSP method (Potts et al. 2014), even if 
the distance is not bounded between 0 and 1, as it is similar to our proposed methods 
(see appendix for mathematical definitions of previously published distances). 

Accuracy of distance measures 

To investigate whether the distance measures were accurate for estimating distances 
between individuals, we simulated tetraploid individuals (2n = 4x) along a species tree 
using the coalescent and estimated the genetic distances between individuals that have 
been evolving for different periods of time. Gene sequences of 1000 bp were simulated 
using MCcoal (Rannala and Yang 2003) on a species tree where the individuals com- 
pared had the following divergence times (r): 0, 0.0005, 0.001, 0.002, 0.003, and 0.005. 
The divergence times (r) represent the expected number of mutations per site from the 
node in the species tree to the present time. However, the expected divergence times of 
the sequences between individuals will be greater than the time of species divergence 
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as the time to coalescence of the sequences in the ancestral species needs to be consid- 
ered (Nei 1987; Edwards and Beerli 2000; Arbogast et al. 2002). The expected time to 
coalescence in the ancestral species (population) is equal to 2N or 9/2 (Edwards and 
Beerli 2000). The expected genetic distance is thus twice the coalescence time expecta- 
tion, which is twice the time since the species divergence plus twice the expectation for 
the coalescent time in the ancestral population: d = 2r + 9. Distance measures were 
thus compared to this expected sequence divergence, but also with the expected species 
divergence (2r). Simulations were performed with two population sizes (9 = 0.001 and 
9 = 0.01) that were held constant throughout the tree. The larger population size 
increased the number of polymorphisms in individuals. All simulations were repeated 
2000 times. 

Estimation of the genomic mixture of hybrids 

To investigate how good the different distance measures are at detecting the genomic 
mixture of hybrid individuals, we estimated and compared the genetic distance of an 
allopolyploid with its two parents. For this, we simulated an allopolyploid speciation 
event. Gene copies inherited from one parent in the allopolyploid were then transferred 
by descent in the allopolyploid species via multisomic inheritance (i.e., they can be 
assumed to form a panmictic population and simulated with the coalescent), and were 
evolving independently from the gene copies inherited from the other parent. This 
allowed us to simulate gene sequences using multi-labeled species trees (see Jones et al. 
2013). The parental species were tetraploids whereas the allopolyploid species was 
either octopolyploid with four gene copies coming from each parent or hexaploid with 
four copies coming from one parent and two from the other. This allowed us to test 
two ratios of parental genome contribution in the hybrid. 

Gene sequences of 1000 bp were simulated on a species tree as described above with a 
population size 9 = 0.001 and with a divergence time to the two parental species fixed at 
t = 0.003. Three different scenarios were investigated for the timing of the allopolyploid 
event: r = 0 (in which case it is an immediate descendent of the two parental species), r 
= 0.001 or r = 0.002. To investigate the hybrid mixture of the allopolyploid individual, 
we estimated an hybrid index that indicates the relative distance of the hybrid from its 
two parents: 

j _ dAX 

dAx + dsx ' 

where A and B are the two parents and X the hybrid, and where d^x is the genetic 
distance between species A and the hybrid. The hybrid index (/) is bounded between 
0 and 1 and an index of 0.5 indicates that the hybrid is equally distant to both parents. 
Cases where both g?a,x and 4 ; i were equal to zero were given I = 0.5. All simulations 
were repeated 2000 times. 

Effect of the number of markers on precision 

We also estimated the impact of gene number on precision in the two previous simulation 
settings. For the precision of the genetic distance estimate, we used the simulations 
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with 9 = 0.001 and the expected distance of 0.01. For the hybrid index, we used 
the framework of the octopolyploid speciation event at r = 0.001. In both cases, we 
evaluated the statistics (distance or hybrid index) with 1, 2, 5, 10, 20, and 40 markers. 
Distances were estimated 100 times for each scenario and standard deviation among 
estimates was computed and plotted to investigate the decrease in standard deviation 
with the number of markers for each method. 

Results 

Theoretical considerations 

Before comparing the different distance methods, it is relevant to note the similarities 
between the SNP-based methods proposed here and the previously published methods 
based on whole marker sequences. For example, mrca is the same as min applied to 
a single nucleotide. As such, it is interesting to compare the performance of this pair 
of methods in the simulations. Moreover, the genpofad distance is equivalent to the 
pofad algorithm of Joly and Bruneau (2006) when applied to a single nucleotide in 
diploid individuals. For a locus evolving under an infinite site mutation model without 
recombination, the genpofad distance should give the same distance as pofad when 
extended to the whole locus (see below). However, genpofad has the advantage that 
it could be applied to individuals of any ploidy level. 

Distance accuracy 

Only genpofad provided an accurate estimate of the sequence divergence (2r + 9; 
Figs. 1, 2). The genpofad estimates were very accurate with small population sizes 
{9 = 0.001), but tend to provide a slightly underestimated distance for small divergence 
times with 9 = 0.01 (Figs. 1, 2). Moreover, it also underestimated sequence divergence 
within populations (i.e., when species divergence = 0), suggesting that it is not a very 
accurate estimator of 9. Nevertheless, it was the best estimator of 9 among the methods 
tested. 

Other distance measures had interesting properties, min underestimated sequence 
divergence (Fig. 1), but provided an accurate estimate of the species divergence (Fig. 
2). matchstates and pbc provided similar estimates that fell between the expected 
sequences divergence and the species divergence. The other estimates either largely 
overestimated sequence divergence (2lSP, nei) or underestimated species divergence 
(mrca) in all situations (Figs. 1, 2). 

Hybrid genetic mixture 

Distances measure were evaluated for estimating the intermediacy of hybrid individuals 
relative to its parents. When the parents contributed an equal number of gene copies, all 
methods were accurate, but nei provided the most precise estimate of the hybrid index 
(Fig. 3). genpofad and 2lSP were the second best methods according to precision, 
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9 = 0.001 



e = o.oi 




Species divergence 



Species divergence 



Figure 1: Boxplots showing the estimated divergence for several distance measures, com- 
pared to expected sequence divergence (d = 2r + 9; dotted lines of the same colour as the 
boxes). Simulations were performed on a species tree with the coalescent using populations 
sizes of 8 = 0.001 (left panels) or 9 = 0.01 (right panels). 
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Figure 2: Plots showing the relationship between median estimated sequence divergence for 
the distance methods and the species divergence used in the simulations, for two populations 
sizes. The gray area indicates the time range between the expected species divergence (d = 2r; 
lower bound) and the expected sequence divergence (d = 2r + 9; upper bound). 
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Figure 3: Boxplots showing a hybrid index (i.e., the relative contribution of each parental 
genome) for the different distance measures and for different time since the allopolyploid 
(hybridization) speciation event. The gray lines indicates the genomic mixtures that were 
simulated: one in which each parent contributed equally (1:1) to the allopolyploid (left panels) 
and another where one parent contributed twice the number of copy (2:1) than the other parent 
(right panels). 



followed very closely by PBC and matchstates. mrca and MIN provided imprecise 
estimates of hybrid index (Fig. 3). 

No method provided an accurate hybrid index estimate when one parent contributed 
twice the number of gene copies as the other (Fig. 3), but some methods performed 
better than others. PBC was by far the best method, followed by genpofad and 
matchstates. As before, mrca and MIN provided the worst estimates of the hybrid 
index. Also, if some evidence for an unequal contribution was visible for the young 
hybrid for genpofad and matchstates, evidence of unequal parental contribution 
for older hybrids was only observed with the PBC distance. 

Effect of the number of markers on precision 

Evaluation of the methods' precision showed different results for the distance accuracy 
and for the hybrid index simulations. For the estimation of the genetic distance, all 
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Figure 4: Plot showing the effect of the number of markers used on the precision of the genetic 
distances (upper panel) and the hybrid index estimates (lower panel). A small standard 
deviation indicates better precision. The simulation settings for the genetic distance were as 
for Fig. la with expected distance of 0.01 and for the hybrid index they were the same as 
those for Fig. 2a with a medium timing for the allopolyploid speciation event. 



methods showed a similar precision and the increase in precision (decrease in standard 
deviation among replicates) was similar for the different methods, with the exception of 
2isp that had a much larger error than all others (Fig. 3a). The pattern was different for 
the precision of the hybrid index. The methods mrca and MIN were much less precise 
than the others and they required more markers to converge on stable estimates (Fig. 
3b). The remaining methods had a similar precision, although they could be ranked as 
followed for precision (from best to worst): NEI > GEPOFAD = 2lSP > MATCHSTATES > 
PBC (Fig. 3b). 



Discussion 

With the increasing use of massively parallel sequencing approaches in evolutionary 
biology, fast, accurate, and precise methods to investigate genetic structure and evolu- 
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tionary history are required. Concatenation approaches are known to be inconsistent in 
some circumstances (Degnan and Rosenberg 2006; Salter Kubatko and Degnan 2007) 
and fully Bayesian approaches to population/species reconstruction (e.g. Heled and 
Drummond 2010; Liu et al. 2009) are computationally demanding with large number 
of markers. If faster coalescent alternatives exist for genomic studies (Bryant et al. 
2012), distance measures nevertheless remain an interesting strategy, especially given 
the consistent properties of some indices (Liu et al. 2009; Mossel and Roch 2010). 

Until now, the toolset of distance measures was limited for studying the relation- 
ships of individuals. Overcoming this shortcoming is critical given that individuals are 
the fundamental unit for many studies at the species level. The main problems encoun- 
tered at this level are those of allelic variation and polyploidy. However, the potential 
presence of recombination in the nuclear genome and the SNP based nature of many 
contemporaneous studies represent further challenges. We thus present here new dis- 
tance measures that all have the property that they are estimated at the nucleotide 
level in order to alleviate these biological complexities. 

Advantages of SNP-based distances 

Interestingly, SNP-based distances do not suffer from the comparison with whole- 
sequence distances in our simulations. This is relevant because the simulation of long 
(1000 bp) sequences without recombination should advantage distances estimated on 
whole sequences. To the contrary, the most accurate method for estimating genetic 
distances was a SNP-based method. Clearly, one can expect SNP-based methods to 
rapidly gain an advantage over whole sequence methods in the presence of recombina- 
tion. In many empirical studies that use large numbers of markers, it is indeed very 
difficult to rule out completely the presence of recombination, especially if markers are 
long. If recombination should not affect the performance of SNP-based methods, it 
will affect those based on whole sequences. SNP-based methods are thus expected to 
be particularly useful given the increasing abundance of genome-scale studies based on 
whole genomes or reduced-representation sequencing data. 

Another important factor to consider is the length of markers. Massively parallel 
sequencing technologies generally result in markers of small sequence lengths. With such 
data, we expect that the relative advantage of distance measures based on the whole 
marker sequence to decrease with decreasing sequence length. Indeed, we can have 
an idea of that effect when going from 1000 bp sequences to SNP data by comparing 
the distances min and mrca as mrca is identical to min applied to a single SNP. 
Consequently, SNP-based methods are particularly well suited for SNP-based studies 
or for studies using short length markers. 

Importance of gene dosage information 

Of the methods evaluated here, two can actually take into account exact gene dosage 
information if known: PBC and NEI. One would expect this type of information to be 
particularly important for estimating unequal genomic mixtures in hybrid individuals. 
This actually seems to be the case for PBC that was the best method according to this 
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criteria. However, nei did not appear to benefit from gene dosage information in the 
same situation. Our results tend to show, however, that gene dosage information is not 
critical for good performance in all situations. This is especially true for the estimation 
of genetic distances where the best method did not use gene dosage information. This 
is a very encouraging result given that such information is rarely known precisely in 
genomic studies involving polyploids. 

Method performances 

In term of genetic distance accuracy, the best method was genpofad, a SNP-based 
method. It provided very accurate estimates of sequence divergence at small population 
sizes (9 = 0.001), even if the estimates were slightly biased at larger population sizes 
(9 = 0.01). It was also found to provide a slightly underestimation of 9 in populations, 
even though it was still better than all other methods in this aspect. 

The minimum allelic distance between individuals (min) provided an accurate esti- 
mate of the species divergence time, which is an interesting property. This observation 
concurs with previous studies that have shown this measure to be a consistent estimator 
of species distances in certain situations (Mossel and Roch 2010; DeGiorgio and Deg- 
nan 2014). However, the simulations showed that this measure performs poorly when it 
comes to estimating the genomic mixture of individuals, both in terms of accuracy and 
precision. Interestingly, two distance measures provided estimates that fell between the 
expected sequence divergence and the species divergence, that is between 2r + 9 and 
2t. These are the matchstates and the pbc methods. 

Regarding hybrid mixture estimates, the best method was clearly pbc that was the 
only method to be close to accurate when estimating unequal contribution of the par- 
ents in the young age hybrid. Moreover, evidence for unequal contribution remained 
even for older hybrids, whereas that signal was lost for all other methods. Note that this 
assumes that we now the exact number of copies in the hybrid (i.e., gene dosage), an in- 
formation that might not be always available in empirical datasets and that could affect 
the performance of the pbc distance. Among other methods, genpofad and match- 
states were slightly better as they showed slight evidence for the unequal parental 
contributions for the young hybrid and they provided precise estimates. The methods 
min and mrca were not precise and did not detect unequal parental contributions. This 
is not surprising as these methods essentially ignore polymorphisms by considering only 
the most similar nucleotides (mrca) or alleles (min). 

Perhaps the best recommendation we can provide is to use the genpofad distance 
in general as this is the most accurate method in terms of expected genetic distance 
and given that it is relatively good at estimating genomic mixture between individuals. 
Moreover, its performance will not be affected by the presence of recombination or if 
only short markers are available. In cases where species divergence times are of interest 
and in absence of recombination, then the min distance is of great interest. Finally, if 
gene dosage is known and genomic admixture is of main interest, then the pbc distance 
is the best choice if recombination is absent. In any case, we hope that this study 
and the simulation framework we propose for comparing the performance of distance 
measures will stimulate the development and testing of further SNP-based distance 
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measures. 

Appendix 

Definition of previously published distance measures 

In the following definitions based on whole markers sequences, Ax represents the com- 
plete set of alleles for individual X and \Ax\ is the number of alleles observed for 
individual X. Also, let dij be the genetic distance between alleles i and j. 

MIN distance 

The MIN distance was proposed by Goker and Grimm (2008) in the present context, 
but it had been often used in other contexts as well (e.g. Joly et al. 2009; Liu et al. 
2009; Mossel and Roch 2010). It can be described as: 

MINxy := min(dij\i G A x ,j G A Y ). 

Phylogenetic Bray-Curtis distance (PBC) 

The PBC distance was defined by Goker and Grimm (2008) as: 



EieA x min(dij\j G A Y ) 


+ T,j£A Y min(dij\ 


i G Ax) 




A x \ 


+ 


A Y \ 





2ISP distance 

The 2lSP distance is a nucleotide-based distance (Potts et al. 2014). It estimates the 
distance between nucleotides using the step-matrix presented in Figure 1 of Potts et al. 
(2014). 
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